This is a chapter about exploratory data analysis (EDA) of raw
dataset “ordinary exams results” which has already performed data
cleaning. The goal is to discover/summarize the main characteristics of
the data set, and the approach is data visualization such as:
1. Distribution of course numbers across faculties (pie chart)
2. Histogram of the course size for all courses (histogram)
3. Percentage of courses across departments (bar chart)
4. Average grade by departments over time (line chart/run chart)
# Load packages
library(graphics)
library(plotly)
library(dplyr) # data manipulation
library(forcats)
library(ggplot2) # visualization
library('DescTools')
library(tidyverse)
ordinary <- read.csv("ordinary.csv")
ordinary$FakultetName <- as.factor(ordinary$FakultetName)
ordinary$Termin <- as.factor(ordinary$Termin) # important
faculty_frequency <- table(ordinary$FakultetName)
names(dimnames(faculty_frequency)) <- 'faculty'
mydata <- as.data.frame(faculty_frequency)
fig_f <- mydata |>
plot_ly(labels = ~faculty,
values = ~Freq,
textposition = "inside",
textinfo = "label+percent",
insidetextorientation="tangential",
type = "pie") |>
layout(title = "Faculty",
xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
fig_f
Among the faculties, “Det Natur- og Biovidenskabebelige Fakultet” has the most records/exams collected in this study – accounts for 33.8% of the total records, while “Det Teologiske Fakultet” has the least records/exams – only 1% of the total records.
Then, we use a histogram to represent the frequency distribution of our data observation of variable “Ord_antal_tilmeldte” (course size).
hist(ordinary$Ord_antal_tilmeldte,breaks = 30, main = "Histogram of size of course", xlab = "Size of courses",
ylim = c(0,6700))
From the statistic plot above, we can see that those small courses (number of students in ordinary exam less or equal to 20) takes the major part (around 6000) of the total courses, while the proportion of the large courses with more than 200 students in ordinary exam is relatively small. The course with the largest number of students for the ordinary exam is the 2020 winter course of the faculty of law “Strafferet og Straffeproces”, with approximately 800 students.
On the other hand, we can also see that the histogram plot is strongly right skewed. Then, we log-transform the variable” Ord_antal_tilmeldte” (= size of courses) to see if the skewness can be reduced.
hist(log(ordinary$Ord_antal_tilmeldte),breaks = 10, main = "Histogram of log-transformed size of course", xlab = "log-transformed size of courses", ylim = c(0,3300))
Now we can see the log-transformed course size is less skewed, thus, the log(Ord_antal_tilmeldte) will be used as a variable of interest in the model instead of “Ord_antal_tilmeldte”.
Returning to the difference in course size, we divide all courses into three groups, namely course size < 20, 20 ~ 200, and > 200.
# department percentage histogram stratified by course size ----
ordinary$Institut <- as.factor(ordinary$Institut)
department_frequency <- table(ordinary$Institut)
#small_course_size(<20)
department_frequency1 <- table(ordinary$Institut[ordinary$Ord_antal_tilmeldte<20])
#medium_course_size(20~200)
department_frequency2 <- table(ordinary$Institut[(ordinary$Ord_antal_tilmeldte>=20)&
(ordinary$Ord_antal_tilmeldte<200)])
#large_course_size(>200)
department_frequency3 <- table(ordinary$Institut[ordinary$Ord_antal_tilmeldte>=200])
# transform above tables into a dataframe
mydata <- as.data.frame(department_frequency)
mydata1 <- as.data.frame(department_frequency1)
mydata2 <- as.data.frame(department_frequency2)
mydata3 <- as.data.frame(department_frequency3)
names(mydata) <- c("department","freq") # replace columns names
mydata$freq1<- mydata1$Freq # add a new column with specified name
mydata$freq2<- mydata2$Freq
mydata$freq3<- mydata3$Freq
#calculate the percentage of each item to the total number
total <- sum(mydata$freq) # total number of courses
mydata$percentage <- mydata$freq/total*100
mydata$per1 <- mydata$freq1/total*100
mydata$per2 <- mydata$freq2/total*100
mydata$per3 <- mydata$freq3/total*100
mydata <- mydata[order(mydata$freq,decreasing = TRUE),] # order department by freq value, largest to smallest
mydata_long <- mydata %>% # pivot to long
pivot_longer(cols = c('per1','per2','per3'), names_to = c('.value','feature')
,names_pattern = "(.{3})(.)") # ".":any alphabet or num; "{}":the number of "."
ggplot(data=mydata_long, aes(x=fct_inorder(department), y=per, fill=feature)) +
geom_bar(stat="identity")+ xlab("department")+ylab("percentage (%)")+
theme(axis.text.x = element_text(angle = 80, hjust = 1))+
scale_fill_discrete(labels = c("<20", "20~200", ">200"))+
guides(fill = guide_legend(title = "Course size")) +
ggtitle("Percentage of the course by different departments")
If we look at the records of each department (see below figure), “Institut for Geovidenskab og Natureforvaltning” (belongs to “Det Natur- og Biovidenskabebelige Fakultet”) has the most records/exams while “neuroscience” has the least records. With stratification by course size, we can see the percentage of each department with different course size group, i.e., course size <20, 20~200, and >200. Most large courses (>200) are in “det sundhedsvidenskabelig facultet”, while the small size (<20) classes are mainly concentrated in the humanities, such as the “Institut for Engelsk, Germansk og Romansk” and the ” Institut for Tværkulturelle og regionale studier”.
Then we plot a line chart to represent the average grade trends
across different departments over time.
#```{r,eval = FALSE}
# calculate (average grade by department over time) and write in data frame as_model
for(i in c(1:nrow(ordinary))){
# number of students for each grade point
grade_pass <- c(ordinary$ordinary_result_12[i],
ordinary$ordinary_result_10[i],
ordinary$ordinary_result_7[i],
ordinary$ordinary_result_4[i],
ordinary$ordinary_result_02[i])
# average grade for each observation
ordinary$ave_raw[i] <- sum((c(12,10,7,4,2)*grade_pass))/sum(grade_pass)
}
as_model <- data.frame(matrix(ncol = 3,nrow = 0)) # initialize an empty data frame
for(i in levels(ordinary$Institut)){ # go through all departments
for (j in levels(ordinary$Termin)[1:10]){ # go through all semester (18vinter-22summer)
# extract all observations average grade for specified department and semester
inst <- ordinary$ave_raw[ordinary$Institut == i & ordinary$Termin==j]
# average of all the average grades
mean_ave <- mean(inst) # average grade for one specified institut*termin
inst_l <- i # department col
ter_l <- j # termin col
ll <- list(inst_l,ter_l, mean_ave)
as_model <- rbind(as_model, ll)
}
}
names(as_model) <- c("institut","termin","average")
as_model$institut <- factor(as_model$institut)
as_model$termin <- factor(as_model$termin)
#print(length(ordinary$Termin))
# plot
# adjust margins size
par(mar=c(5, 4, 4, 8.3), xpd=TRUE)
xtick <- 1:10
# sorted from high to low by average grade of termin 22.5 (used as legend)
aa <- order(as_model$average[as_model$termin=="22.5"],decreasing = T)
pp <- aa %% 26
for (i in aa){
ind <- as_model$institut == levels(as_model$institut)[i] #index
y <- as_model$average[ind]
ind <- !is.nan(y)
if (i == aa[1]){
plot((1:10)[ind], y[ind], type = "b", col = i ,
ylim=c(6,11),xlim=c(1,10),axes=F,
pch = pp[i], xlab="Termin", ylab="Average Grade", xaxt="n")
}
else{
lines((1:10)[ind], y[ind], type = "b", lwd=1, col = i, pch=pp[i] )
}
}
axis(side=1, at = xtick, labels=levels(ordinary$Termin)[1:10])
axis(side=2)
title(main = "Average grade by deparmetment over time")
deppp_legend <- levels(as_model$institut)
legend("topright",legend=deppp_legend[aa],box.lty=0,col=aa,inset=c(-0.525,0.1),
pch=pp[aa], cex=0.7)
The plot shows the department for “Klinisk Medicin” has the highest average grade (around 10.5), while the department for “Odontologi” and the department for “Farmaci” have the lowest average grade (around 7).